An Enhanced Approach to Handle Missing Values in Heterogeneous Dataset

نویسندگان

  • Yongsong Qin
  • Shichao Zhang
  • Xiaofeng Zhu
  • Jilian Zhang
چکیده

Generally, data mining (sometimes called data or knowledge discovery, knowledge extraction, knowledge discovery) is the process of analyzing huge voluminous data from different perspectives and summarizing it into the useful information. Hence data quality is much important to get the high quality pattern as result. Quality decisions ought to be based on quality data. Data quality is affected by the presence of missing values called holes because of various reasons. In order to make the database as complete by filling the holes with plausible value, variety of imputation methods have been developed. But they are limited to handle missing values in homogenous attributes only. Few of the existing systems uses the mixture kernel function for imputing missing values in mixed attribute datasets. In the proposed work, new imputation framework has been developed to handle missing values in heterogeneous datasets. Firstly pre-imputation is performed using ENI (Encapsidated Neighbour Imputation) approach followed by the application of Gaussian Kernel function to both continuous and discrete attributes. The proposed framework is tested with its competitors for various standard missing rates over bench dataset UCI repository. The behaviour of the framework proposed in this paper is studied using the parameter RMSE and concluded that it is behaving good. Keywords— Data Mining, Missing Value Imputation, Kernel

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank

Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...

متن کامل

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

Multiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files.

Multiple imputation (MI) is a technique that can be used for handling missing data in a public-use dataset. With MI, two or more completed versions of the dataset are created, containing possibly different but reasonable replacements for the missing data. Users analyse the completed datasets separately with standard techniques and then combine the results using simple formulae in a way that all...

متن کامل

A Comparative Study on Decision Rule Induction for incomplete data using Rough Set and Random Tree Approaches

Handling missing attribute values is the greatest challenging process in data analysis. There are so many approaches that can be adopted to handle the missing attributes. In this paper, a comparative analysis is made of an incomplete dataset for future prediction using rough set approach and random tree generation in data mining. The result of simple classification technique (using random tree ...

متن کامل

A BAYESIAN APPROACH TO COMPUTING MISSING REGRESSOR VALUES

In this article, Lindley's measure of average information is used to measure the information contained in incomplete observations on the vector of unknown regression coefficients [9]. This measure of information may be used to compute the missing regressor values.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014